{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Training OpenAI GPT-2.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LH2YgC7LfzJZ",
        "colab_type": "text"
      },
      "source": [
        "#Training OpenAI GTP-2\n",
        "Copyright 2020, Denis Rothman MIT License. Denis Rothman created the Colab notebook using the OpenAI repository, adding title steps for educational purposes only.\n",
        "\n",
        "***Code References***\n",
        "\n",
        "[Reference: OpenAI Repository](https://github.com/openai/gpt-2)\n",
        "The repository was cloned and adapted to N Shepperd's repository.\n",
        "\n",
        "[Reference: N Shepperd Repository](https://github.com/nshepperd/gpt-2)\n",
        "The repository was not cloned. N Shepperd's training programs were inserted into the OpenAI Repository. The list of N Shepperd's programs are cited in the 'N Shepperd' section of the notebook. Some programs were modified for educational purposes only to work with this notebook.\n",
        "\n",
        "***Model Reference Paper***\n",
        "\n",
        "[Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,2019,'Language Models are Unsupervised Multitask Learners'](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)\n",
        "\n",
        "\n",
        "***Step 1: Pre-requisites:***\n",
        "\n",
        "a) activate GPU in the notebook settings runTime menu <br>\n",
        "b) Upload the following program files and dset.txt(dataset) with the file manager: train.py,load_dataset.py,encode.py,accumulate,memory_saving_gradients.py,dset.txt"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "isqdu1fpfmqM",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 102
        },
        "outputId": "1d38b600-5f4a-4d66-a00c-f8cab8f5a158"
      },
      "source": [
        "#@title Step 2: Cloning the OpenAI GPT-2 Repository \n",
        "#!git clone https://github.com/nshepperd/gpt-2.git\n",
        "!git clone https://github.com/openai/gpt-2.git"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Cloning into 'gpt-2'...\n",
            "remote: Enumerating objects: 230, done.\u001b[K\n",
            "remote: Total 230 (delta 0), reused 0 (delta 0), pack-reused 230\u001b[K\n",
            "Receiving objects: 100% (230/230), 4.38 MiB | 6.13 MiB/s, done.\n",
            "Resolving deltas: 100% (119/119), done.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7RHOjN-TjUbj",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 3: Installing the requirements\n",
        "import os                     # when the VM restarts import os necessary\n",
        "os.chdir(\"/content/gpt-2\")    \n",
        "!pip3 install -r requirements.txt"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "q9vV73Opw68m",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "482b8e2a-4e62-437c-dcf1-1046c2d28a0a"
      },
      "source": [
        "!pip install toposort"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting toposort\n",
            "  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl\n",
            "Installing collected packages: toposort\n",
            "Successfully installed toposort-1.5\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_kpNCnh9fyYD",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        },
        "outputId": "4d140c05-569d-4dc6-a2ba-faaa883aac88"
      },
      "source": [
        "#@title Step 4: Checking TensorFlow version \n",
        "#Colab has tf 1.x and tf 2.x installed\n",
        "#Restart runtime using 'Runtime' -> 'Restart runtime...'\n",
        "%tensorflow_version 1.x\n",
        "import tensorflow as tf\n",
        "print(tf.__version__)"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "TensorFlow 1.x selected.\n",
            "1.15.2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jvVj0cLVkaPL",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 136
        },
        "outputId": "eb34c742-4323-45de-b5bb-bf403cbed7c8"
      },
      "source": [
        "#@title Step 5: Downloading 117M parameter GPT-2 Model\n",
        "# run code and send argument\n",
        "import os # after runtime is restarted\n",
        "os.chdir(\"/content/gpt-2\")\n",
        "!python3 download_model.py '117M' #creates model directory"
      ],
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\rFetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]\rFetching checkpoint: 1.00kit [00:00, 874kit/s]                                                      \n",
            "\rFetching encoder.json:   0%|                                           | 0.00/1.04M [00:00<?, ?it/s]\rFetching encoder.json: 1.04Mit [00:00, 45.4Mit/s]                                                   \n",
            "Fetching hparams.json: 1.00kit [00:00, 779kit/s]                                                    \n",
            "Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 77.3Mit/s]                                  \n",
            "Fetching model.ckpt.index: 6.00kit [00:00, 4.30Mit/s]                                               \n",
            "Fetching model.ckpt.meta: 472kit [00:00, 54.3Mit/s]                                                 \n",
            "Fetching vocab.bpe: 457kit [00:00, 60.0Mit/s]                                                       \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aV5K8rvD1b-r",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 6: Copying the Project Resources to scr\n",
        "!cp /content/dset.txt /content/gpt-2/src/\n",
        "!cp -r /content/gpt-2/models/ /content/gpt-2/src/"
      ],
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dTUxDwtWlOLf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 7: Copying the N Shepperd Training Files\n",
        "#Referfence GitHub repository: https://github.com/nshepperd/gpt-2\n",
        "import os # import after runtime is restarted\n",
        "!cp /content/train.py /content/gpt-2/src/\n",
        "!cp /content/load_dataset.py /content/gpt-2/src/\n",
        "!cp /content/encode.py /content/gpt-2/src/\n",
        "!cp /content/accumulate.py /content/gpt-2/src/\n",
        "!cp /content/memory_saving_gradients.py /content/gpt-2/src/"
      ],
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "B6T2OrWoOvG0",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 68
        },
        "outputId": "12c9ede7-4ddc-4046-84d0-808ea04b4a81"
      },
      "source": [
        "#@title Step 8:Encoding dataset\n",
        "import os # import after runtime is restarted\n",
        "os.chdir(\"/content/gpt-2/src/\")\n",
        "model_name=\"117M\"\n",
        "!python /content/gpt-2/src/encode.py dset.txt out.npz "
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Reading files\n",
            "100% 1/1 [00:01<00:00,  1.27s/it]\n",
            "Writing out.npz\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UzlkNGbAkDBk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 9:Training the Model\n",
        "#Model saved after 1000 steps\n",
        "import os # import after runtime is restarted\n",
        "os.chdir(\"/content/gpt-2/src/\")\n",
        "!python train.py --dataset out.npz"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "z-zAFd2hLQ2V",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 10: Creating a Training Model directory\n",
        "#Creating a Training Model directory named 'tgmodel'\n",
        "import os\n",
        "run_dir = '/content/gpt-2/models/tgmodel'\n",
        "if not os.path.exists(run_dir):\n",
        "  os.makedirs(run_dir)"
      ],
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-POx-g1Ql76C",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 10A: Copying training Files\n",
        "!cp /content/gpt-2/src/checkpoint/run1/model-1000.data-00000-of-00001 /content/gpt-2/models/tgmodel\n",
        "!cp /content/gpt-2/src/checkpoint/run1/checkpoint /content/gpt-2/models/tgmodel\n",
        "!cp /content/gpt-2/src/checkpoint/run1/model-1000.index /content/gpt-2/models/tgmodel\n",
        "!cp /content/gpt-2/src/checkpoint/run1/model-1000.meta /content/gpt-2/models/tgmodel"
      ],
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hdE9nNH8m7VD",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 10B: Copying the OpenAI GPT-2 117M Model files\n",
        "!cp /content/gpt-2/models/117M/encoder.json /content/gpt-2/models/tgmodel\n",
        "!cp /content/gpt-2/models/117M/hparams.json /content/gpt-2/models/tgmodel\n",
        "!cp /content/gpt-2/models/117M/vocab.bpe /content/gpt-2/models/tgmodel"
      ],
      "execution_count": 11,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3G8NOUXjMq4u",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "0a2bd956-a7d9-4c6d-dbbd-46fe493ed2a8"
      },
      "source": [
        "#@title Step 10C: Renaming the model directories\n",
        "import os\n",
        "!mv /content/gpt-2/models/117M  /content/gpt-2/models/117M_OpenAI\n",
        "!mv /content/gpt-2/models/tgmodel  /content/gpt-2/models/117M"
      ],
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "mv: cannot stat '/content/gpt-2/models/tgmodel': No such file or directory\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h3uexz_e4d18",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 11: Generating Unconditional Samples\n",
        "import os # import after runtime is restarted\n",
        "os.chdir(\"/content/gpt-2/src\")\n",
        "!python generate_unconditional_samples.py --model_name '117M'"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6HI7DuBK4iSU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Step 12: Interactive Context and Completion Examples\n",
        "import os # import after runtime is restarted\n",
        "os.chdir(\"/content/gpt-2/src\")\n",
        "!python interactive_conditional_samples.py --temperature 0.8 --top_k 40 --model_name '117M'"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}